8,984 research outputs found

    Manual Annotation of Translational Equivalence: The Blinker Project

    Get PDF
    Bilingual annotators were paid to link roughly sixteen thousand corresponding words between on-line versions of the Bible in modern French and modern English. These annotations are freely available to the research community from http://www.cis.upenn.edu/~melamed . The annotations can be used for several purposes. First, they can be used as a standard data set for developing and testing translation lexicons and statistical translation models. Second, researchers in lexical semantics will be able to mine the annotations for insights about cross-linguistic lexicalization patterns. Third, the annotations can be used in research into certain recently proposed methods for monolingual word-sense disambiguation. This paper describes the annotated texts, the specially-designed annotation tool, and the strategies employed to increase the consistency of the annotations. The annotation process was repeated five times by different annotators. Inter-annotator agreement rates indicate that the annotations are reasonably reliable and that the method is easy to replicate

    Automatic Discovery of Non-Compositional Compounds in Parallel Data

    Full text link
    Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discovering sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of non-compositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunciations.Comment: 12 pages; uses natbib.sty, here.st

    Models of Co-occurrence

    Get PDF
    A model of co-occurrence in bitext is a boolean predicate that indicates whether a given pair of word tokens co-occur in corresponding regions of the bitext space. Co-occurrence is a precondition for the possibility that two tokens might be mutual translations. Models of co-occurrence are the glue that binds methods for mapping bitext correspondence with methods for estimating translation models into an integrated system for exploiting parallel texts. Different models of co-occurrence are possible, depending on the kind of bitext map that is available, the language-specific information that is available, and the assumptions made about the nature of translational equivalence. Although most statistical translation models are based on models of co-occurrence, modeling co-occurrence correctly is more difficult than it may at first appear

    Automatic Construction of Clean Broad-Coverage Translation Lexicons

    Full text link
    Word-level translational equivalences can be extracted from parallel texts by surprisingly simple statistical techniques. However, these techniques are easily fooled by {\em indirect associations} --- pairs of unrelated words whose statistical properties resemble those of mutual translations. Indirect associations pollute the resulting translation lexicons, drastically reducing their precision. This paper presents an iterative lexicon cleaning method. On each iteration, most of the remaining incorrect lexicon entries are filtered out, without significant degradation in recall. This lexicon cleaning technique can produce translation lexicons with recall and precision both exceeding 90\%, as well as dictionary-sized translation lexicons that are over 99\% correct.Comment: PostScript file, 10 pages. To appear in Proceedings of AMTA-9

    Word-to-Word Models of Translational Equivalence

    Full text link
    Parallel texts (bitexts) have properties that distinguish them from other kinds of parallel data. First, most words translate to only one other word. Second, bitext correspondence is noisy. This article presents methods for biasing statistical translation models to reflect these properties. Analysis of the expected behavior of these biases in the presence of sparse data predicts that they will result in more accurate models. The prediction is confirmed by evaluation with respect to a gold standard -- translation models that are biased in this fashion are significantly more accurate than a baseline knowledge-poor model. This article also shows how a statistical translation model can take advantage of various kinds of pre-existing knowledge that might be available about particular language pairs. Even the simplest kinds of language-specific knowledge, such as the distinction between content words and function words, is shown to reliably boost translation model performance on some tasks. Statistical models that are informed by pre-existing knowledge about the model domain combine the best of both the rationalist and empiricist traditions

    Fringe field simulations of a non-scaling FFAG accelerator

    Full text link
    Fixed-field Alternating Gradient (FFAG) accelerators offer the potential of high-quality, moderate energy ion beams at low cost. Modeling of these structures is challenging with conventional beam tracking codes because of the large radial excursions of the beam and the significance of fringe field effects. Numerous tune resonances are crossed during the acceleration, which would lead to beam instability and loss in a storage ring. In a non-scaling FFAG, the hope is that these resonances can be crossed sufficiently rapidly to prevent beam loss. Simulations are required to see if this is indeed the case. Here we simulate a non-scaling FFAG which accelerates protons from 31 to 250 MeV. We assume only that the bending magnets have mid-plane symmetry, with specified vertical bending field in the mid-plane (y=0). The magnetic field can be obtained everywhere using a power series expansion, and we develop mathematical tools for calculating this expansion to arbitrary order when the longitudinal field profile is given by an Enge function. We compare the use of a conventional hard-edge fringe with a more accurate, soft-edge fringe field model. The tune 1/3 resonance is the strongest, and crossing it in the hard-edge fringe model results in a 21% loss of the beam. Using the soft-edge fringe model the beam loss is less than 6%.Comment: 12 pages; 12 figure

    Embedding multidimensional grids into optimal hypercubes

    Full text link
    Let GG and HH be graphs, with ∣V(H)∣β‰₯∣V(G)∣|V(H)|\geq |V(G)| , and f:V(G)β†’V(H)f:V(G)\rightarrow V(H) a one to one map of their vertices. Let dilation(f)=max{distH(f(x),f(y)):xy∈E(G)}dilation(f) = max\{ dist_{H}(f(x),f(y)): xy\in E(G) \}, where distH(v,w)dist_{H}(v,w) is the distance between vertices vv and ww of HH. Now let B(G,H)B(G,H) = minf{dilation(f)}min_{f}\{ dilation(f) \}, over all such maps ff. The parameter B(G,H)B(G,H) is a generalization of the classic and well studied "bandwidth" of GG, defined as B(G,P(n))B(G,P(n)), where P(n)P(n) is the path on nn points and n=∣V(G)∣n = |V(G)|. Let [a1Γ—a2Γ—β‹―Γ—ak][a_{1}\times a_{2}\times \cdots \times a_{k} ] be the kk-dimensional grid graph with integer values 11 through aia_{i} in the ii'th coordinate. In this paper, we study B(G,H)B(G,H) in the case when G=[a1Γ—a2Γ—β‹―Γ—ak]G = [a_{1}\times a_{2}\times \cdots \times a_{k} ] and HH is the hypercube QnQ_{n} of dimension n=⌈log2(∣V(G)∣)βŒ‰n = \lceil log_{2}(|V(G)|) \rceil, the hypercube of smallest dimension having at least as many points as GG. Our main result is that B([a1Γ—a2Γ—β‹―Γ—ak],Qn)≀3k,B( [a_{1}\times a_{2}\times \cdots \times a_{k} ],Q_{n}) \le 3k, provided aiβ‰₯222a_{i} \geq 2^{22} for each 1≀i≀k1\le i\le k. For such GG, the bound 3k3k improves on the previous best upper bound 4k+O(1)4k+O(1). Our methods include an application of Knuth's result on two-way rounding and of the existence of spanning regular cyclic caterpillars in the hypercube.Comment: 47 pages, 8 figure
    • …
    corecore